6 research outputs found
A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages
Modern large language models demonstrate impressive capabilities in text
generation and generalization. However, they often struggle with solving text
editing tasks, particularly when it comes to correcting spelling errors and
mistypings. In this paper, we present a methodology for generative spelling
correction (SC), which was tested on English and Russian languages and
potentially can be extended to any language with minor changes. Our research
mainly focuses on exploring natural spelling errors and mistypings in texts and
studying the ways those errors can be emulated in correct sentences to
effectively enrich generative models' pre-train procedure. We investigate the
impact of such emulations and the models' abilities across different text
domains. In this work, we investigate two spelling corruption techniques: 1)
first one mimics human behavior when making a mistake through leveraging
statistics of errors from particular dataset and 2) second adds the most common
spelling errors, keyboard miss clicks, and some heuristics within the texts. We
conducted experiments employing various corruption strategies, models'
architectures and sizes on the pre-training and fine-tuning stages and
evaluated the models using single-domain and multi-domain test sets. As a
practical outcome of our work, we introduce SAGE(Spell checking via
Augmentation and Generative distribution Emulation). It is a library for
automatic generative SC that includes a family of pre-trained generative models
and built-in augmentation algorithms.Comment: to appear in EACL 202
DaNetQA: a yes/no Question Answering Dataset for the Russian Language
DaNetQA, a new question-answering corpus, follows (Clark et. al, 2019)
design: it comprises natural yes/no questions. Each question is paired with a
paragraph from Wikipedia and an answer, derived from the paragraph. The task is
to take both the question and a paragraph as input and come up with a yes/no
answer, i.e. to produce a binary output. In this paper, we present a
reproducible approach to DaNetQA creation and investigate transfer learning
methods for task and language transferring. For task transferring we leverage
three similar sentence modelling tasks: 1) a corpus of paraphrases,
Paraphraser, 2) an NLI task, for which we use the Russian part of XNLI, 3)
another question answering task, SberQUAD. For language transferring we use
English to Russian translation together with multilingual language fine-tuning.Comment: Analysis of Images, Social Networks and Texts - 9 th International
Conference, AIST 2020, Skolkovo, Russia, October 15-16, 2020, Revised
Selected Papers. Lecture Notes in Computer Science
(https://dblp.org/db/series/lncs/index.html), Springer 202
mGPT: Few-Shot Learners Go Multilingual
Recent studies report that autoregressive language models can successfully
solve many NLP tasks via zero- and few-shot learning paradigms, which opens up
new possibilities for using the pre-trained language models. This paper
introduces two autoregressive GPT-like models with 1.3 billion and 13 billion
parameters trained on 60 languages from 25 language families using Wikipedia
and Colossal Clean Crawled Corpus. We reproduce the GPT-3 architecture using
GPT-2 sources and the sparse attention mechanism; Deepspeed and Megatron
frameworks allow us to parallelize the training and inference steps
effectively. The resulting models show performance on par with the recently
released XGLM models by Facebook, covering more languages and enhancing NLP
possibilities for low resource languages of CIS countries and Russian small
nations. We detail the motivation for the choices of the architecture design,
thoroughly describe the data preparation pipeline, and train five small
versions of the model to choose the most optimal multilingual tokenization
strategy. We measure the model perplexity in all covered languages and evaluate
it on the wide spectre of multilingual tasks, including classification,
generative, sequence labeling and knowledge probing. The models were evaluated
with the zero-shot and few-shot methods. Furthermore, we compared the
classification tasks with the state-of-the-art multilingual model XGLM. source
code and the mGPT XL model are publicly released
A Family of Pretrained Transformer Language Models for Russian
Nowadays, Transformer language models (LMs) represent a fundamental component
of the NLP research methodologies and applications. However, the development of
such models specifically for the Russian language has received little
attention. This paper presents a collection of 13 Russian Transformer LMs based
on the encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and
encoder-decoder (ruT5, FRED-T5) models in multiple sizes. Access to these
models is readily available via the HuggingFace platform. We provide a report
of the model architecture design and pretraining, and the results of evaluating
their generalization abilities on Russian natural language understanding and
generation datasets and benchmarks. By pretraining and releasing these
specialized Transformer LMs, we hope to broaden the scope of the NLP research
directions and enable the development of industrial solutions for the Russian
language
Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian
We present the shared task on artificial text detection in Russian, which is
organized as a part of the Dialogue Evaluation initiative, held in 2022. The
shared task dataset includes texts from 14 text generators, i.e., one human
writer and 13 text generative models fine-tuned for one or more of the
following generation tasks: machine translation, paraphrase generation, text
summarization, text simplification. We also consider back-translation and
zero-shot generation approaches. The human-written texts are collected from
publicly available resources across multiple domains. The shared task consists
of two sub-tasks: (i) to determine if a given text is automatically generated
or written by a human; (ii) to identify the author of a given text. The first
task is framed as a binary classification problem. The second task is a
multi-class classification problem. We provide count-based and BERT-based
baselines, along with the human evaluation on the first sub-task. A total of 30
and 8 systems have been submitted to the binary and multi-class sub-tasks,
correspondingly. Most teams outperform the baselines by a wide margin. We
publicly release our codebase, human evaluation results, and other materials in
our GitHub repository (https://github.com/dialogue-evaluation/RuATD).Comment: Accepted to Dialogue-2